4a¶
GSE84133 is a single-cell RNA-seq dataset of human pancreatic islets from multiple donors, processed using the 10X Genomics platform. The raw count matrices were downloaded from GEO. Non-gene annotation rows were removed, and sample metadata (like barcode and cluster assignments) was separated to retain only numeric gene expression data. QC metrics (total counts, number of genes, and mitochondrial gene percentage) were calculated to assess cell quality. Cells with low gene counts and genes expressed in very few cells were filtered out to remove noise. The data was then normalized to 10,000 counts per cell to account for sequencing depth differences and log-transformed to stabilize variance across genes. All samples were concatenated into a single dataset. Highly variable genes were identified to focus the analysis on the most informative genes. The data was scaled to zero mean and unit variance, and batch effects between samples were corrected to avoid technical biases. Principal Component Analysis (PCA) was performed for dimensionality reduction. A neighborhood graph was computed based on the PCA space to capture the structure of the data. Cells were clustered using the Leiden algorithm, and marker genes were identified for each cluster. and based on the expression of known marker genes, biological cell type labels were assigned to the clusters.¶
4b¶
Step | Key Tools Used |¶
Download raw count matrices | wget, |¶
Remove non-gene annotations | pandas |¶
Separate metadata | pandas |¶
Calculate QC metrics | Scanpy |¶
Filter low-quality cells/genes | Scanpy |¶
Normalize counts | Scanpy |¶
Log-transform normalized data | Scanpy |¶
Concatenate samples | Scanpy, AnnData |¶
Identify highly variable genes (HVGs) | Scanpy |¶
Scale data (zero mean, unit variance) | Scanpy |¶
Batch correction | Harmony |¶
Perform PCA | Scanpy |¶
Compute neighborhood graph | Scanpy |¶
Cluster cells (Leiden) | Scanpy |¶
Find marker genes | Scanpy |¶
Assign cell type annotations | Scanpy, known markers |¶
In [ ]:
# Install packages
In [161]:
import os
import requests
import pandas as pd
import numpy as np
import scanpy as sc
import subprocess
import anndata as ad
import matplotlib.pyplot as plt
import seaborn as sns
import harmonypy as hm
import warnings
import plotly.express as px
np.random.seed(42)
In [105]:
os.makedirs("data", exist_ok=True)
gsm_ids = {
"GSM2230757": "GSM2230757_human1_umifm_counts.csv.gz",
"GSM2230758": "GSM2230758_human2_umifm_counts.csv.gz",
"GSM2230759": "GSM2230759_human3_umifm_counts.csv.gz",
"GSM2230760": "GSM2230760_human4_umifm_counts.csv.gz"
}
for gsm_id, filename in gsm_ids.items():
subdir = gsm_id[:7] + "nnn"
url = f"https://ftp.ncbi.nlm.nih.gov/geo/samples/{subdir}/{gsm_id}/suppl/{filename}"
output_path = os.path.join("data", filename)
if os.path.isfile(output_path):
print(f"{filename} already exists, skipping.")
continue
print(f" Downloading {filename}...")
try:
subprocess.run(["wget", "-O", output_path, url], check=True)
print(f"Downloaded {filename}")
except subprocess.CalledProcessError as e:
print(f"Failed to download {filename}: {e}")
GSM2230757_human1_umifm_counts.csv.gz already exists, skipping. GSM2230758_human2_umifm_counts.csv.gz already exists, skipping. GSM2230759_human3_umifm_counts.csv.gz already exists, skipping. GSM2230760_human4_umifm_counts.csv.gz already exists, skipping.
In [107]:
dataframes = {}
for gsm_id, filename in gsm_ids.items():
file_path = os.path.join("data", filename)
try:
df = pd.read_csv(file_path, index_col=0)
dataframes[gsm_id] = df
print(f" Loaded {filename} with shape {df.shape}")
except Exception as e:
print(f" Error loading {filename}: {e}")
Loaded GSM2230757_human1_umifm_counts.csv.gz with shape (1937, 20127) Loaded GSM2230758_human2_umifm_counts.csv.gz with shape (1724, 20127) Loaded GSM2230759_human3_umifm_counts.csv.gz with shape (3605, 20127) Loaded GSM2230760_human4_umifm_counts.csv.gz with shape (1303, 20127)
In [109]:
# Initialize dictionaries to store the columns separately
barcode_cluster_pk_dataframes = {}
# Loop through each GSM DataFrame in the dictionary
for gsm_id, df in dataframes.items():
# Separate the barcode, assigned_cluster, and pk columns
barcode_cluster_pk_data = df[['barcode', 'assigned_cluster', 'pk']]
# Store this separated data in a new dictionary
barcode_cluster_pk_dataframes[gsm_id] = barcode_cluster_pk_data
# Remove the columns 'barcode', 'assigned_cluster', and 'pk' from the original dataframe
df.drop(columns=['barcode', 'assigned_cluster', 'pk'], inplace=True)
# Print shape of updated DataFrame
print(f"{gsm_id}: {df.shape[0]} rows and {df.shape[1]} columns after removing specified columns.")
# The `barcode_cluster_pk_dataframes` dictionary now contains the separated columns for each GSM.
GSM2230757: 1937 rows and 20124 columns after removing specified columns. GSM2230758: 1724 rows and 20124 columns after removing specified columns. GSM2230759: 3605 rows and 20124 columns after removing specified columns. GSM2230760: 1303 rows and 20124 columns after removing specified columns.
QC Metrics¶
In [111]:
adata_dict = {}
# Convert dataframes into AnnData objects and store them in adata_dict
for gsm_id, df in dataframes.items():
adata = sc.AnnData(df)
adata_dict[gsm_id] = adata
# Calculate QC metrics for each sample
for gsm_id, adata in adata_dict.items():
# Identify mitochondrial genes (assuming "MT-" prefix for human data)
mito_genes = adata.var_names.str.startswith('MT')
# Add percentage of mitochondrial counts to the adata.obs
adata.obs['pct_counts_mt'] = adata[:, mito_genes].X.sum(axis=1) / adata.X.sum(axis=1) * 100
# Calculate other QC metrics (total counts, number of genes, etc.)
sc.pp.calculate_qc_metrics(adata, inplace=True)
print(f"Calculated QC metrics for {gsm_id}")
# Check the available GSM IDs in the data
print("Available GSM IDs in the data:", adata_dict.keys())
# Print the QC metrics for each sample
for gsm_id, adata in adata_dict.items():
print(f"\nQC Metrics for {gsm_id}:")
print(f"Total counts: {adata.obs['total_counts'].mean():.2f}")
print(f"Number of genes: {adata.obs['n_genes_by_counts'].mean():.2f}")
print(f"Percentage of mitochondrial counts: {adata.obs['pct_counts_mt'].mean():.2f}%")
Calculated QC metrics for GSM2230757 Calculated QC metrics for GSM2230758 Calculated QC metrics for GSM2230759 Calculated QC metrics for GSM2230760 Available GSM IDs in the data: dict_keys(['GSM2230757', 'GSM2230758', 'GSM2230759', 'GSM2230760']) QC Metrics for GSM2230757: Total counts: 5805.05 Number of genes: 1923.72 Percentage of mitochondrial counts: 0.27% QC Metrics for GSM2230758: Total counts: 5109.00 Number of genes: 1893.88 Percentage of mitochondrial counts: 0.22% QC Metrics for GSM2230759: Total counts: 5858.42 Number of genes: 1750.07 Percentage of mitochondrial counts: 0.30% QC Metrics for GSM2230760: Total counts: 6729.19 Number of genes: 2202.82 Percentage of mitochondrial counts: 0.23%
In [113]:
for gsm_id, adata in adata_dict.items():
print(f"\nGenerating QC plots for {gsm_id}")
# Violin plots
sc.pl.violin(
adata,
['total_counts', 'n_genes_by_counts', 'pct_counts_mt'],
jitter=0.4,
multi_panel=True
)
plt.suptitle(f"{gsm_id} - Violin plots", fontsize=14)
plt.tight_layout()
plt.show()
Generating QC plots for GSM2230757
<Figure size 640x480 with 0 Axes>
Generating QC plots for GSM2230758
<Figure size 640x480 with 0 Axes>
Generating QC plots for GSM2230759
<Figure size 640x480 with 0 Axes>
Generating QC plots for GSM2230760
<Figure size 640x480 with 0 Axes>
In [275]:
print(adata_dict.keys())
dict_keys(['GSM2230757', 'GSM2230758', 'GSM2230759', 'GSM2230760'])
Filter out low-quality cells¶
Based on QC plots, I filtered cells expressing fewer than 200 genes, removed genes detected in fewer than 3 cells, and excluded cells with total UMI counts below 1,000 or above 20,000. Cells with >5% mitochondrial gene expression were removed to ensure high data quality for clustering and annotation.¶
In [115]:
# Thresholds
min_genes_per_cell = 200
min_cells_per_gene = 3
min_umi_per_cell = 1000
max_umi_per_cell = 20000
max_mito_percent = 5.0 # in %
# Create a new dictionary to hold filtered data
filtered_dataframes = {}
for gsm_id, df in dataframes.items():
print(f"\nFiltering {gsm_id}...")
# Calculate QC metrics
total_counts = df.sum(axis=1)
gene_counts = (df > 0).sum(axis=1)
# Identify mitochondrial genes
mito_genes = [gene for gene in df.columns if gene.startswith('MT')]
mito_counts = df[mito_genes].sum(axis=1)
mito_percent = (mito_counts / total_counts) * 100
# Apply all filters
cells_pass_filter = (
(gene_counts >= min_genes_per_cell) &
(total_counts >= min_umi_per_cell) &
(total_counts <= max_umi_per_cell) &
(mito_percent <= max_mito_percent)
)
df_filtered_cells = df[cells_pass_filter]
print(f" Cells retained after QC: {df_filtered_cells.shape[0]}")
# Filter genes (columns) expressed in fewer than min_cells_per_gene cells
genes_pass_filter = (df_filtered_cells > 0).sum(axis=0) >= min_cells_per_gene
df_filtered = df_filtered_cells.loc[:, genes_pass_filter]
print(f" Genes retained after QC: {df_filtered.shape[1]}")
# Store the filtered dataframe
filtered_dataframes[gsm_id] = df_filtered
# Now filter annotations to match
filtered_annotations = {}
for gsm_id in filtered_dataframes:
filtered_cells = filtered_dataframes[gsm_id].index
ann_df = barcode_cluster_pk_dataframes[gsm_id]
filtered_ann_df = ann_df.loc[filtered_cells]
filtered_annotations[gsm_id] = filtered_ann_df
print(f"{gsm_id}: Filtered annotations shape: {filtered_ann_df.shape}")
Filtering GSM2230757... Cells retained after QC: 1922 Genes retained after QC: 14713 Filtering GSM2230758... Cells retained after QC: 1724 Genes retained after QC: 14883 Filtering GSM2230759... Cells retained after QC: 3563 Genes retained after QC: 15202 Filtering GSM2230760... Cells retained after QC: 1290 Genes retained after QC: 14444 GSM2230757: Filtered annotations shape: (1922, 3) GSM2230758: Filtered annotations shape: (1724, 3) GSM2230759: Filtered annotations shape: (3563, 3) GSM2230760: Filtered annotations shape: (1290, 3)
In [117]:
normalized_data = {}
for gsm_id, df in filtered_dataframes.items():
# Create AnnData object
adata = ad.AnnData(df)
# Normalize: scale each cell's total counts to 10,000
sc.pp.normalize_total(adata, target_sum=1e4)
# Store normalized AnnData (not yet log-transformed)
normalized_data[gsm_id] = adata
In [119]:
print(f"Number of negative entries: {(adata.X < 0).sum()}")
Number of negative entries: 0
In [121]:
sc.pp.normalize_total(adata, target_sum=1e4)
print(f"Number of negative entries: {(adata.X < 0).sum()}")
Number of negative entries: 0
In [123]:
log_transformed_data = {}
for gsm_id, adata in normalized_data.items():
# Log1p transformation (no plots)
sc.pp.log1p(adata)
# Store log-transformed data
log_transformed_data[gsm_id] = adata
In [125]:
# Concatenate all log-transformed AnnData objects
combined_adata = sc.concat(list(log_transformed_data.values()),
label='batch',
keys=list(log_transformed_data.keys()),
index_unique='-')
# Check the combined object
print(combined_adata)
AnnData object with n_obs × n_vars = 8499 × 13774
obs: 'batch'
In [127]:
print(combined_adata.shape) # cells x genes
print(combined_adata.obs['batch'].value_counts()) # how many cells per batch
(8499, 13774) batch GSM2230759 3563 GSM2230757 1922 GSM2230758 1724 GSM2230760 1290 Name: count, dtype: int64
In [129]:
sc.pp.highly_variable_genes(combined_adata,
flavor='seurat',
n_top_genes=2000)
# Print summary
print(f"Number of highly variable genes selected: {combined_adata.var['highly_variable'].sum()}")
sc.pl.highly_variable_genes(combined_adata)
Number of highly variable genes selected: 2000
In [131]:
sc.pp.scale(combined_adata, zero_center=True, max_value=None)
# Check the scaled data
print(combined_adata)
AnnData object with n_obs × n_vars = 8499 × 13774
obs: 'batch'
var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'hvg'
In [133]:
sc.tl.pca(combined_adata, svd_solver='arpack', random_state= 42 )
# Plot the PCA results (2D)
sc.pl.pca(combined_adata, color='batch')
# Show the explained variance ratio to understand the amount of variance captured by each principal component
print(combined_adata.uns['pca']['variance_ratio'])
[0.06281348 0.04058672 0.0271614 0.0235323 0.02264049 0.01322893 0.00852874 0.00778488 0.00736579 0.00519212 0.00502104 0.00461948 0.00422595 0.00370149 0.00320128 0.00308771 0.00270745 0.0026192 0.00260219 0.00252267 0.00238146 0.00231575 0.00230704 0.00217362 0.00211461 0.00203691 0.00202781 0.00197115 0.00186409 0.00178382 0.00176703 0.00171424 0.00167931 0.00162129 0.00161178 0.00160302 0.00153201 0.00150662 0.0014808 0.00146569 0.00144795 0.00143309 0.00140103 0.00139155 0.00137676 0.00135418 0.00133402 0.00132486 0.00132042 0.00131809]
In [135]:
sc.pp.neighbors(combined_adata, use_rep='X_pca', random_state= 42)
sc.tl.umap(combined_adata,random_state= 42)
# Save the UMAP coordinates before batch correction
combined_adata.obsm['X_umap_orig'] = combined_adata.obsm['X_umap'].copy()
In [137]:
# Run Harmony on the PCA space
ho = hm.run_harmony(
combined_adata.obsm['X_pca'], # PCA matrix
combined_adata.obs, # Metadata (the obs table)
vars_use=['batch'] # Column name(s) in obs that define batches
)
# Save the corrected PCA embeddings into a new slot
combined_adata.obsm['X_pca_harmony'] = ho.Z_corr.T # Harmony returns PCs x cells; transpose!
# Recompute neighbors on Harmony-corrected PCA
sc.pp.neighbors(combined_adata, use_rep='X_pca_harmony', random_state= 42) # <- very important!
# Now compute UMAP
sc.tl.umap(combined_adata, random_state= 42)
# Save the UMAP coordinates
combined_adata.obsm['X_umap_harmony'] = combined_adata.obsm['X_umap'].copy()
# Plot side-by-side comparison for before and after correctn
fig, axs = plt.subplots(1, 2, figsize=(14, 6))
# Plot original UMAP (before Harmony)
sc.pl.embedding(
combined_adata,
basis='X_umap_orig',
color='batch',
title='Before Batch Correction',
ax=axs[0],
show=False
)
# Plot Harmony-corrected UMAP
sc.pl.embedding(
combined_adata,
basis='X_umap_harmony',
color='batch',
title='After Harmony Correction',
ax=axs[1],
show=False
)
plt.tight_layout()
plt.show()
2025-05-03 08:58:26,280 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans... 2025-05-03 08:58:27,049 - harmonypy - INFO - sklearn.KMeans initialization complete. 2025-05-03 08:58:27,066 - harmonypy - INFO - Iteration 1 of 10 2025-05-03 08:58:27,741 - harmonypy - INFO - Iteration 2 of 10 2025-05-03 08:58:28,395 - harmonypy - INFO - Iteration 3 of 10 2025-05-03 08:58:29,058 - harmonypy - INFO - Iteration 4 of 10 2025-05-03 08:58:29,700 - harmonypy - INFO - Iteration 5 of 10 2025-05-03 08:58:30,366 - harmonypy - INFO - Iteration 6 of 10 2025-05-03 08:58:30,993 - harmonypy - INFO - Iteration 7 of 10 2025-05-03 08:58:31,360 - harmonypy - INFO - Iteration 8 of 10 2025-05-03 08:58:31,741 - harmonypy - INFO - Converged after 8 iterations
In [141]:
sc.tl.leiden(combined_adata, resolution=0.5)
In [143]:
sc.pl.umap(combined_adata, color='leiden', title='Clusters after Harmony batch correction', legend_loc='on data')
sc.pp.neighbors(
combined_adata,
use_rep='X_pca_harmony' # use the Harmony-corrected PCs
)
In [145]:
# 1. Re-run UMAP asking for 3 components
sc.tl.umap(
combined_adata,
n_components=3
)
# 2. Plot 3D UMAP
sc.pl.umap(
combined_adata,
color=['batch', 'leiden'], # You can color by batch or cluster
title='3D UMAP after Harmony batch correction',
projection='3d'
)
WARNING: The title list is shorter than the number of panels. Using 'color' value instead for some plots.
In [147]:
# Define pancreas cell markers
marker_genes = {
'Alpha': ['GCG', 'ARX'],
'Beta': ['INS', 'MAFA', 'NKX6-1'],
'Delta': ['SST', 'HHEX'],
'PP': ['PPY'],
'Acinar': ['PRSS1', 'CPA1', 'AMY2A'],
'Ductal': ['KRT19', 'CFTR', 'SOX9'],
'Endothelial': ['PECAM1', 'VWF'],
'Macrophage': ['CD68', 'LYZ'],
'Stellate': ['COL1A1', 'PDGFRB']
}
In [75]:
sns.set(style="whitegrid")
valid_genes = {cell_type: [gene for gene in genes if gene in combined_adata.var_names]
for cell_type, genes in marker_genes.items()}
for cell_type, genes in valid_genes.items():
if genes:
sc.tl.score_genes(combined_adata, gene_list=genes, score_name=f'{cell_type}_score', use_raw=False)
score_cols = [f'{ct}_score' for ct in valid_genes.keys()]
combined_adata.obs['cell_type'] = combined_adata.obs[score_cols].idxmax(axis=1)
combined_adata.obs['cell_type'] = combined_adata.obs['cell_type'].str.replace('_score', '')
if 'X_umap' not in combined_adata.obsm:
print("UMAP embeddings not found.")
else:
print("UMAP embeddings found.")
sc.pl.umap(combined_adata,
color='cell_type',
palette='tab10',
title='UMAP with Cell Types',
legend_loc=None,
show=False)
plt.grid(True, linestyle='-', linewidth=0.5, alpha=0.6)
plt.title('UMAP with Cell Types')
for i, cell_type in enumerate(combined_adata.obs['cell_type'].unique()):
subset = combined_adata[combined_adata.obs['cell_type'] == cell_type]
plt.text(subset.obsm['X_umap'][:, 0].mean(),
subset.obsm['X_umap'][:, 1].mean(),
cell_type,
horizontalalignment='center',
verticalalignment='center',
fontsize=12,
weight='bold')
plt.show()
UMAP embeddings found.
In [151]:
import plotly.express as px
sns.set(style="whitegrid")
# Score and assign cell types
valid_genes = {cell_type: [gene for gene in genes if gene in combined_adata.var_names]
for cell_type, genes in marker_genes.items()}
for cell_type, genes in valid_genes.items():
if genes:
sc.tl.score_genes(combined_adata, gene_list=genes, score_name=f'{cell_type}_score', use_raw=False)
score_cols = [f'{ct}_score' for ct in valid_genes.keys()]
combined_adata.obs['cell_type'] = combined_adata.obs[score_cols].idxmax(axis=1)
combined_adata.obs['cell_type'] = combined_adata.obs['cell_type'].str.replace('_score', '')
# Check UMAP
if 'X_umap' not in combined_adata.obsm:
print("UMAP embeddings not found.")
else:
print("UMAP embeddings found.")
# Run UMAP again if needed
if combined_adata.obsm['X_umap'].shape[1] < 3:
print("Warning: UMAP does not have 3 dimensions. Running 3D UMAP...")
sc.tl.umap(combined_adata, n_components=3)
# Create a DataFrame for plotting
umap_df = pd.DataFrame(
combined_adata.obsm['X_umap'],
columns=['UMAP1', 'UMAP2', 'UMAP3'],
index=combined_adata.obs.index
)
umap_df['cell_type'] = combined_adata.obs['cell_type'].values
# Interactive 3D plot
fig = px.scatter_3d(
umap_df,
x='UMAP1', y='UMAP2', z='UMAP3',
color='cell_type',
title='Interactive 3D UMAP with Cell Types',
opacity=0.7,
height=800,
width=900
)
fig.update_layout(
legend=dict(itemsizing='constant'),
margin=dict(l=0, r=0, b=0, t=40),
scene=dict(
xaxis_title='UMAP1',
yaxis_title='UMAP2',
zaxis_title='UMAP3'
)
)
fig.show()
UMAP embeddings found.
In [163]:
warnings.filterwarnings("ignore")
# Find marker genes per cell type using Wilcoxon test
sc.tl.rank_genes_groups(
combined_adata,
groupby='cell_type',
method='wilcoxon',
n_genes=20
)
# top 5 genes per cell type
result = combined_adata.uns['rank_genes_groups']
groups = result['names'].dtype.names # the different cell types
top_markers = []
for group in groups:
top_markers.extend(result['names'][group][:5]) # top 5 markers per cell type
top_markers = list(set(top_markers))
# 6. Plot a DotPlot for top 5 marker genes
import seaborn as sns
sns.set_context('notebook', font_scale=1.5)
sns.set_palette('coolwarm')
# Create the DotPlot
fig, ax = plt.subplots(figsize=(14, 6))
sc.pl.dotplot(
combined_adata,
var_names=top_markers,
groupby='cell_type',
standard_scale='var',
dot_max=0.7,
color_map='coolwarm',
ax=ax,
dendrogram=True
)
ax.set_title('Top 5 Marker Genes per Cell Type', fontsize=18)
plt.tight_layout()
plt.show()
<Figure size 640x480 with 0 Axes>
In [155]:
barcode_cluster_pk_dataframes
Out[155]:
{'GSM2230757': barcode assigned_cluster pk
human1_lib1.final_cell_0001 GATGACGGAC-GGTGGGAT acinar 1
human1_lib1.final_cell_0002 GAGCGTTGCT-ACCTTCTT acinar 0
human1_lib1.final_cell_0003 CTTACGGG-CCATTACT acinar 0
human1_lib1.final_cell_0004 GATGTACACG-TTAAACTG acinar 0
human1_lib1.final_cell_0005 GAGATTGCGA-GTCGTCGT acinar 1
... ... ... ..
human1_lib3.final_cell_0736 GAGAGAGTAT-GATTTACC endothelial 0
human1_lib3.final_cell_0737 TGATTCGCTGG-CTTCTGGA beta 0
human1_lib3.final_cell_0738 GCTTACCT-GGCATGCT endothelial 0
human1_lib3.final_cell_0739 CGGCACAT-TGGCCTGT beta 0
human1_lib3.final_cell_0740 TGCCTCAC-ACATCTAT quiescent_stellate 0
[1937 rows x 3 columns],
'GSM2230758': barcode assigned_cluster pk
human2_lib1.final_cell_0001 TCCAGGGA-CTGGTGCA delta 0
human2_lib1.final_cell_0002 GAAAGATTGT-TCCGTCCA delta 0
human2_lib1.final_cell_0003 AAATCAGA-AGTGATGC alpha 0
human2_lib1.final_cell_0004 ATAGTGGAC-GAGAGTAT delta 0
human2_lib1.final_cell_0005 GACTTACTCC-AAAGCCTA delta 0
... ... ... ..
human2_lib3.final_cell_0561 CATCGCAG-GCAACCTG activated_stellate 0
human2_lib3.final_cell_0562 GAGGACTTCC-CGGACAAC macrophage 0
human2_lib3.final_cell_0563 GTAACGTT-ATTCCTTG ductal 0
human2_lib3.final_cell_0564 GTGTAACC-TGTCTTTC activated_stellate 0
human2_lib3.final_cell_0565 GACCTACTAG-GCTTTGGC alpha 0
[1724 rows x 3 columns],
'GSM2230759': barcode assigned_cluster pk
human3_lib1.final_cell_0001 TGAAAACTGGT-ATGTTGGC acinar 0
human3_lib1.final_cell_0002 ACGGAATTT-GATCGTTT acinar 2
human3_lib1.final_cell_0003 TGAGGCGGTTT-AAAGCCTA acinar 2
human3_lib1.final_cell_0004 ACGTATAC-TGATAACA acinar 1
human3_lib1.final_cell_0005 AAATGAATG-GAAAGACC acinar 2
... ... ... ..
human3_lib4.final_cell_0895 TGACTAGTAAC-TTAAACTG alpha 0
human3_lib4.final_cell_0896 TGCTATTT-AAACTCGA beta 0
human3_lib4.final_cell_0897 GCTTACCT-GGCCTAAG alpha 0
human3_lib4.final_cell_0898 GACCCATAGC-CTCCGCAT beta 0
human3_lib4.final_cell_0899 GAACTGCCGT-TCTCACTT alpha 0
[3605 rows x 3 columns],
'GSM2230760': barcode assigned_cluster pk
human4_lib1.final_cell_0001 AATATCTTC-AGTGAAAG ductal 0
human4_lib1.final_cell_0002 AGGCAACG-GCATGGGT delta 0
human4_lib1.final_cell_0003 AACGCAGAG-TTGTCGCC delta 0
human4_lib1.final_cell_0004 CGGCTTAC-CGGGCTTT ductal 0
human4_lib1.final_cell_0005 AAGCTACGG-TGTAGTTT ductal 1
... ... ... ..
human4_lib3.final_cell_0697 GAGATCTCGG-GTCTCTCT activated_stellate 0
human4_lib3.final_cell_0698 GCTTACCT-ATGTTGGC alpha 0
human4_lib3.final_cell_0699 TGACACAGTTT-TTGTCGCC beta 0
human4_lib3.final_cell_0700 GACGACTCCT-CGCTAATA beta 0
human4_lib3.final_cell_0701 TGATGCCC-TTGCACGC ductal 0
[1303 rows x 3 columns]}
In [157]:
combined_adata
Out[157]:
AnnData object with n_obs × n_vars = 8499 × 13774
obs: 'batch', 'leiden', 'Alpha_score', 'Beta_score', 'Delta_score', 'PP_score', 'Acinar_score', 'Ductal_score', 'Endothelial_score', 'Macrophage_score', 'Stellate_score', 'cell_type'
var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'hvg', 'pca', 'batch_colors', 'neighbors', 'umap', 'leiden', 'leiden_colors', 'rank_genes_groups', 'dendrogram_cell_type'
obsm: 'X_pca', 'X_umap', 'X_umap_orig', 'X_pca_harmony', 'X_umap_harmony'
varm: 'PCs'
obsp: 'distances', 'connectivities'
In [159]:
combined_adata.write('combined_adata.h5ad')